Data Formats

GESIS Workshop: Introduction to Geospatial Techniques for Social Scientists in R

Stefan Jünger & Dennis Abel

2025-04-09

Day Time Title
April 09 10:00-11:30 Introduction
April 09 11:30-11:45 Coffee Break
April 09 11:45-13:00 Data Formats
April 09 13:00-14:00 Lunch Break
April 09 14:00-15:30 Mapping I
April 09 15:30-15:45 Coffee Break
April 09 15:45-17:00 Spatial Wrangling
April 10 09:00-10:30 Mapping II
April 10 10:30-10:45 Coffee Break
April 10 10:45-12:00 Applied Spatial Linking
April 10 12:00-13:00 Lunch Break
April 10 13:00-14:30 Spatial Autocorrelation
April 10 14:30-14:45 Coffee Break
April 10 14:45-16:00 Spatial Econometrics & Outlook

Why care about data types and formats?

There are differences in how spatial information is stored, processed, and visually represented.

  • Different commands for data import and manipulation
  • Spatial linking techniques and analyses partly determined by data format
  • Visualization of data can vary

So, always know what kind of data you are dealing with!

Vector and raster data

Sources: OpenStreetMap / GEOFABRIK (2018), City of Cologne (2014), and the Statistical Offices of the Federation and the Länder (2016) / Jünger, 2019

1. Vector data

Representing the world in vectors

The surface of the earth is represented by simple geometries and attributes.

Each object is defined by longitude (x) and latitude (y) values.

It could also include z coordinates…

Vector data: geometries

Every real-world feature is one of three types of geometry:

  • Points: discrete location (e.g., a tree)
  • Lines: linear feature (e.g., a river)
  • Polygons: enclosed areas (e.g., a city, country, administrative boundaries)

National Ecological Observatory Network (NEON), cited by Datacarpentry

Vector data: attribute tables

Only geometries means that we do not have any other information.

We must assign attributes to each geometry to hold additional information \(\rightarrow\) data tables called attribute tables.

  • Each row represents a geometric object, which we can also call observation, feature, or case
  • Each column holds an attribute or, in ‘our’ language, a variable

Vector data: attribute tables

File formats/extensions

  • GeoPackage .gpkg
  • Shapefile .shp
  • GeoJSON .geojson
  • Sometimes, vector data come even in a text format, such as CSV

Welcome to the proprietary reality: shapefiles

Both the geometric information and attribute table can be saved within one file. Rather often, ESRI Shapefiles are used to store vector data. Shapefiles consist of at least three mandatory files with the following extensions:

  • .shp: shape format
  • .shx: shape index format
  • .dbf: attribute format
  • (.prj: CRS/Projection)

You don’t have to remember what they stand for, but you can only properly work with the data if none of those files is missing.

Welcome to simple features

Several packages are out there to wrangle and visualize spatial and, especially, vector data within R. We will use the sf package (“simple features”).


Why?

simple features refers to a formal standard representing spatial geometries and supports interfaces to other programming languages and GIS systems (ISO 19125-1).

Illustration by Allison Horst

Load a shapefile

The first step is, of course, loading the data. We want to import the shapefile for the administrative borders of the German states (Bundesländer) called VG250_LAN.shp.

# load library
library(sf)

# load data
german_states <- sf::read_sf("./data/VG250_LAN.shp")

What is this thing?

german_states
Simple feature collection with 35 features and 23 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 280371.1 ymin: 5235856 xmax: 921292.4 ymax: 6106244
Projected CRS: ETRS89 / UTM zone 32N
# A tibble: 35 × 24
     ADE    GF   BSG ARS   AGS   SDV_ARS     GEN   BEZ     IBZ BEM   NBD   SN_L 
   <int> <int> <int> <chr> <chr> <chr>       <chr> <chr> <int> <chr> <chr> <chr>
 1     2     4     1 01    01    0100200000… Schl… Land     20 --    ja    01   
 2     2     4     1 02    02    0200000000… Hamb… Frei…    22 --    ja    02   
 3     2     4     1 03    03    0324100010… Nied… Land     20 --    ja    03   
 4     2     4     1 04    04    0401100000… Brem… Frei…    23 --    ja    04   
 5     2     4     1 05    05    0511100000… Nord… Land     20 --    ja    05   
 6     2     4     1 06    06    0641400000… Hess… Land     20 --    ja    06   
 7     2     4     1 07    07    0731500000… Rhei… Land     20 --    ja    07   
 8     2     4     1 08    08    0811100000… Bade… Land     20 --    ja    08   
 9     2     4     1 09    09    0916200000… Baye… Frei…    21 --    ja    09   
10     2     4     1 10    10    1004101001… Saar… Land     20 --    nein  10   
# ℹ 25 more rows
# ℹ 12 more variables: SN_R <chr>, SN_K <chr>, SN_V1 <chr>, SN_V2 <chr>,
#   SN_G <chr>, FK_S3 <chr>, NUTS <chr>, ARS_0 <chr>, AGS_0 <chr>, WSK <date>,
#   DEBKG_ID <chr>, geometry <MULTIPOLYGON [m]>

We can already plot it

plot(german_states["SN_L"])

This is the bounding box

Inspect your data: classics

Let’s have a quick look at the imported data. Like with every other data set, we can inspect the data to check some metadata and see if the importing worked correctly.

# object type
class(german_states) 
[1] "sf"         "tbl_df"     "tbl"        "data.frame"
# number of rows
nrow(german_states)
[1] 35
# number of columns
ncol(german_states)
[1] 24

Inspect your data: classics

There are no huge differences between the shapefile we just imported and a regular data table.

# head of data table
head(german_states, 2)
Simple feature collection with 2 features and 23 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 426205.6 ymin: 5913462 xmax: 650128.7 ymax: 6101487
Projected CRS: ETRS89 / UTM zone 32N
# A tibble: 2 × 24
    ADE    GF   BSG ARS   AGS   SDV_ARS      GEN   BEZ     IBZ BEM   NBD   SN_L 
  <int> <int> <int> <chr> <chr> <chr>        <chr> <chr> <int> <chr> <chr> <chr>
1     2     4     1 01    01    010020000000 Schl… Land     20 --    ja    01   
2     2     4     1 02    02    020000000000 Hamb… Frei…    22 --    ja    02   
# ℹ 12 more variables: SN_R <chr>, SN_K <chr>, SN_V1 <chr>, SN_V2 <chr>,
#   SN_G <chr>, FK_S3 <chr>, NUTS <chr>, ARS_0 <chr>, AGS_0 <chr>, WSK <date>,
#   DEBKG_ID <chr>, geometry <MULTIPOLYGON [m]>

Inspect your data: spatial features

Besides our general data inspection, we may also want to check the spatial features of our import. This check includes the geometric type (points? lines? polygons?) and the coordinate reference system.

# type of geometry
sf::st_geometry(german_states) 
Geometry set for 35 features 
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 280371.1 ymin: 5235856 xmax: 921292.4 ymax: 6106244
Projected CRS: ETRS89 / UTM zone 32N
First 5 geometries:

Inspect your data: spatial features

Each polygon is defined by several connected points to build an enclosed area. Several polygons in one data frame have the sf type multipolygons. Just as Germany consists of several states, the polygon Germany consists of several smaller polygons.

# extract the simple features column and further inspecting 
attr(german_states, "sf_column") |> 
  dplyr::pull(german_states, var = _) |> 
  dplyr::glimpse()
sfc_MULTIPOLYGON of length 35; first list element: List of 28
 $ :List of 1
  ..$ : num [1:11707, 1:2] 464811 464937 465073 465235 465354 ...
 $ :List of 1
  ..$ : num [1:757, 1:2] 634993 635176 635330 635473 635552 ...
 $ :List of 1
  ..$ : num [1:306, 1:2] 471895 472061 472272 472312 472384 ...
 $ :List of 1
  ..$ : num [1:190, 1:2] 480925 480903 481160 481210 481281 ...
 $ :List of 1
  ..$ : num [1:230, 1:2] 457616 457561 457500 457464 457503 ...
 $ :List of 1
  ..$ : num [1:234, 1:2] 477852 477719 477605 477561 477535 ...
 $ :List of 1
  ..$ : num [1:117, 1:2] 468857 468880 468943 469077 469224 ...
 $ :List of 1
  ..$ : num [1:105, 1:2] 534788 534808 534478 534402 534334 ...
 $ :List of 1
  ..$ : num [1:123, 1:2] 535979 536027 536039 536060 536084 ...
 $ :List of 1
  ..$ : num [1:68, 1:2] 483056 483131 482991 482989 482964 ...
 $ :List of 1
  ..$ : num [1:98, 1:2] 479878 479947 479994 480009 480007 ...
 $ :List of 1
  ..$ : num [1:68, 1:2] 488077 488141 488316 488413 488445 ...
 $ :List of 1
  ..$ : num [1:39, 1:2] 527492 527482 527421 527332 527237 ...
 $ :List of 1
  ..$ : num [1:60, 1:2] 480594 480694 480760 481193 481251 ...
 $ :List of 1
  ..$ : num [1:59, 1:2] 427174 427275 427432 427585 427458 ...
 $ :List of 1
  ..$ : num [1:43, 1:2] 429524 429512 429537 429550 429563 ...
 $ :List of 1
  ..$ : num [1:35, 1:2] 488694 488986 489140 489227 489050 ...
 $ :List of 1
  ..$ : num [1:38, 1:2] 471007 470960 470930 470874 470840 ...
 $ :List of 1
  ..$ : num [1:42, 1:2] 482577 482657 482684 482774 482823 ...
 $ :List of 1
  ..$ : num [1:39, 1:2] 536845 536814 536782 536747 536704 ...
 $ :List of 1
  ..$ : num [1:34, 1:2] 485251 485277 485279 485261 485235 ...
 $ :List of 1
  ..$ : num [1:28, 1:2] 635056 635121 635182 635220 635266 ...
 $ :List of 1
  ..$ : num [1:23, 1:2] 468313 468243 468176 468114 468108 ...
 $ :List of 1
  ..$ : num [1:14, 1:2] 547663 547739 547807 547906 548038 ...
 $ :List of 1
  ..$ : num [1:34, 1:2] 459204 459240 459271 459299 459325 ...
 $ :List of 1
  ..$ : num [1:33, 1:2] 536538 536569 536602 536621 536612 ...
 $ :List of 1
  ..$ : num [1:15, 1:2] 643372 643346 643314 643289 643245 ...
 $ :List of 1
  ..$ : num [1:7, 1:2] 546233 546342 545809 545894 546030 ...
 - attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"

Inspect your data: spatial features

Remember: The Coordinate Reference System is critical. A crucial step is to check the CRS of your geospatial data.

# coordinate reference system
sf::st_crs(german_states) 
Coordinate Reference System:
  User input: ETRS89 / UTM zone 32N 
  wkt:
PROJCRS["ETRS89 / UTM zone 32N",
    BASEGEOGCRS["ETRS89",
        ENSEMBLE["European Terrestrial Reference System 1989 ensemble",
            MEMBER["European Terrestrial Reference Frame 1989"],
            MEMBER["European Terrestrial Reference Frame 1990"],
            MEMBER["European Terrestrial Reference Frame 1991"],
            MEMBER["European Terrestrial Reference Frame 1992"],
            MEMBER["European Terrestrial Reference Frame 1993"],
            MEMBER["European Terrestrial Reference Frame 1994"],
            MEMBER["European Terrestrial Reference Frame 1996"],
            MEMBER["European Terrestrial Reference Frame 1997"],
            MEMBER["European Terrestrial Reference Frame 2000"],
            MEMBER["European Terrestrial Reference Frame 2005"],
            MEMBER["European Terrestrial Reference Frame 2014"],
            MEMBER["European Terrestrial Reference Frame 2020"],
            ELLIPSOID["GRS 1980",6378137,298.257222101,
                LENGTHUNIT["metre",1]],
            ENSEMBLEACCURACY[0.1]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4258]],
    CONVERSION["UTM zone 32N",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",0,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",9,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",0.9996,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",500000,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",0,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["(E)",east,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["(N)",north,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Engineering survey, topographic mapping."],
        AREA["Europe between 6°E and 12°E: Austria; Belgium; Denmark - onshore and offshore; Germany - onshore and offshore; Norway including - onshore and offshore; Spain - offshore."],
        BBOX[38.76,6,84.33,12.01]],
    ID["EPSG",25832]]

Exercise 2_1: Import Vector Data

Exercise

2. Raster data

Difference to vector data

Data Structure:

  • Other data format(s), different file extensions
  • Geometries do not differ within one dataset

Implications:

  • Other geospatial operations possible

Benefits:

  • Can be way more efficient and straightforward to process
  • It’s like working with simple tabular data

Visual difference between vector and raster data

What exactly are raster data?

  • Hold information on (most of the time) evenly shaped grid cells
  • Basically, a simple data table
  • Each cell represents one observation

Metadata

  • Information about geometries is globally stored
  • They are the same for all observations
  • Their location in space is defined by their cell location in the data table
  • Without this information, raster data were simple image files

Important metadata

Raster Dimensions: number of columns, rows, and cells

Extent: similar to bounding box in vector data

Resolution: the size of each raster cell

Coordinate reference system: defines where on the earth’s surface the raster layer lies

Setting up a raster dataset is easy

input_data <- 
  sample(1:100, 16) |> 
  matrix(nrow = 4)

raster_layer <- terra::rast(input_data)

raster_layer
class       : SpatRaster 
dimensions  : 4, 4, 1  (nrow, ncol, nlyr)
resolution  : 1, 1  (x, y)
extent      : 0, 4, 0, 4  (xmin, xmax, ymin, ymax)
coord. ref. :  
source(s)   : memory
name        : lyr.1 
min value   :    24 
max value   :    93 

We can already plot it

terra::plot(raster_layer)

File formats/extensions

  • GeoTIFF tif
  • Gridded data .grd
  • Network common data format .nc
  • Esri grid .asc
  • Sometimes, raster data come even in a text format, such as CSV

In this course, we will only use tiff files as they are pretty common. Just be aware that there are different formats out there.

Implementations in R

terra is the most commonly used package for raster data in R.

Some other developments, e.g., in the stars package, also implement an interface to simple features in sf.

The terra package helps to use more elaborate zonal statistics. But the spatstat package is more elaborated.

Loading raster tiffs (census data)

immigrants_cologne <- terra::rast("./data/immigrants_cologne.tif")

inhabitants_cologne <- terra::rast("./data/inhabitants_cologne.tif")

immigrants_cologne
inhabitants_cologne
class       : SpatRaster 
dimensions  : 289, 264, 1  (nrow, ncol, nlyr)
resolution  : 100, 100  (x, y)
extent      : 4094850, 4121250, 3084050, 3112950  (xmin, xmax, ymin, ymax)
coord. ref. : ETRS89-extended / LAEA Europe (with axis order normalized for visualization) 
source      : immigrants_cologne.tif 
name        : immigrants_cologne 
min value   :                  3 
max value   :                639 
class       : SpatRaster 
dimensions  : 289, 264, 1  (nrow, ncol, nlyr)
resolution  : 100, 100  (x, y)
extent      : 4094850, 4121250, 3084050, 3112950  (xmin, xmax, ymin, ymax)
coord. ref. : ETRS89-extended / LAEA Europe (with axis order normalized for visualization) 
source      : inhabitants_cologne.tif 
name        : inhabitants_cologne 
min value   :                   3 
max value   :                 956 

Compare layers by plotting

terra::plot(immigrants_cologne)

terra::plot(inhabitants_cologne)

Preparing for analysis / base R functionalities

immigrants_cologne[immigrants_cologne == -9] <- NA
inhabitants_cologne[inhabitants_cologne == -9] <- NA

immigrants_cologne
class       : SpatRaster 
dimensions  : 289, 264, 1  (nrow, ncol, nlyr)
resolution  : 100, 100  (x, y)
extent      : 4094850, 4121250, 3084050, 3112950  (xmin, xmax, ymin, ymax)
coord. ref. : ETRS89-extended / LAEA Europe (with axis order normalized for visualization) 
source(s)   : memory
varname     : immigrants_cologne 
name        : immigrants_cologne 
min value   :                  3 
max value   :                639 
inhabitants_cologne
class       : SpatRaster 
dimensions  : 289, 264, 1  (nrow, ncol, nlyr)
resolution  : 100, 100  (x, y)
extent      : 4094850, 4121250, 3084050, 3112950  (xmin, xmax, ymin, ymax)
coord. ref. : ETRS89-extended / LAEA Europe (with axis order normalized for visualization) 
source(s)   : memory
varname     : inhabitants_cologne 
name        : inhabitants_cologne 
min value   :                   3 
max value   :                 956 

Simple statistics

Working with raster data is straightforward

  • quite speedy
  • yet not as comfortable as working with sf objects

For example, to calculate the mean, we would use the following:

terra::global(immigrants_cologne, fun = "mean", na.rm = TRUE)
                      mean
immigrants_cologne 15.1337

Get all values as a vector

We can also extract the values of a raster directly as a vector:

all_raster_values <- terra::values(immigrants_cologne)

mean(all_raster_values, na.rm = TRUE)
[1] 15.1337

Nevertheless, although raster data are simple data tables, working with them is a bit different compared to, e.g., simple features.

Combining raster layers to calculate new values

immigrant_rate <-
  immigrants_cologne * 100 / 
  inhabitants_cologne

immigrant_rate
class       : SpatRaster 
dimensions  : 289, 264, 1  (nrow, ncol, nlyr)
resolution  : 100, 100  (x, y)
extent      : 4094850, 4121250, 3084050, 3112950  (xmin, xmax, ymin, ymax)
coord. ref. : ETRS89-extended / LAEA Europe (with axis order normalized for visualization) 
source(s)   : memory
varname     : immigrants_cologne 
name        : immigrants_cologne 
min value   :          0.6637168 
max value   :        100.0000000 

Exercise 2_2: Basic Raster Operations

Exercise